Add qwen3.5-fp4-b200-trt-mtp single-node TensorRT-LLM benchmark by RohitNagraj · Pull Request #1894 · SemiAnalysisAI/InferenceX

RohitNagraj · 2026-06-23T04:35:52Z

Adds the qwen3.5-fp4-b200-trt-mtp config — Qwen3.5-397B-A17B-NVFP4 on B200, single-node TensorRT-LLM with MTP speculative decode — for the 1k/1k and 8k/1k cells with a TP/TEP/DEP parallelism sweep.

nvidia-master.yaml: new config entry + MTP search space.
qwen3.5_fp4_b200_trt_mtp.sh: trtllm-serve benchmark script; generates the extra-llm-api config (MoE backend, attention-DP / batch-wait settings, MTP speculative config) per parallelism mode.
perf-changelog entry.

Note

Low Risk
Benchmark-only wiring (YAML config, launch script, changelog); no production inference, auth, or data-path changes.

Overview
Adds qwen3.5-fp4-b200-trt-mtp so Qwen3.5-397B-A17B-NVFP4 on B200 can be measured with single-node TensorRT-LLM and MTP speculative decode, alongside the existing non-MTP qwen3.5-fp4-b200-trt entry.

nvidia-master.yaml registers the config on tensorrt-llm/release:1.3.0rc18 with 1k/1k and 8k/1k fixed-seq-len cells and a TP / EP / attention-DP search space where every point sets spec-decoding: "mtp".

qwen3.5_fp4_b200_trt_mtp.sh drives trtllm-serve (pytorch backend): disables FlashInfer GDN prefill for MTP, writes qwen3.5-fp4-trt-mtp.yml with MTP (num_nextn_predict_layers: 3), CUTEDSL vs TRTLLM MoE and KV / batch-wait tuning keyed off DP attention and ISL/TP/EP, then runs the standard serving benchmark (optional lm-eval).

perf-changelog.yaml documents the new config key.

^{Reviewed by Cursor Bugbot for commit 7649ae1. Bugbot is set up for automated code reviews on this repo. Configure here.}

Add the qwen3.5-fp4-b200-trt-mtp config (Qwen3.5-397B-A17B-NVFP4, B200, 1k/1k and 8k/1k) with MTP speculative decode across a TP/TEP/DEP parallelism sweep, the qwen3.5_fp4_b200_trt_mtp.sh benchmark script, and a perf-changelog entry.

# Conflicts: # perf-changelog.yaml

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 7649ae1. Configure here.}

cursor · 2026-06-23T04:42:47Z

CUDA graph sizes exceed max batch

Medium Severity

The extra LLM config hardcodes cuda_graph_config.batch_sizes through 128, while trtllm-serve gets --max_batch_size from CONC or CONC/8 (often 4–16 in this recipe). Peer Qwen and TRT-MTP scripts tie CUDA graph capture to MAX_BATCH_SIZE via max_batch_size, so graph warmup can overshoot the runtime batch cap and risk validation failures or excess memory use on low-concurrency jobs.

^{Reviewed by Cursor Bugbot for commit 7649ae1. Configure here.}

functionstackx · 2026-06-23T05:02:26Z

+run_benchmark_serving \
+    --model "$MODEL" \
+    --port "$PORT" \
+    --backend openai \
+    --input-len "$ISL" \
+    --output-len "$OSL" \
+    --random-range-ratio "$RANDOM_RANGE_RATIO" \
+    --num-prompts "$(( CONC * 10 ))" \
+    --max-concurrency "$CONC" \
+    --result-filename "$RESULT_FILENAME" \
+    --result-dir /workspace/


missing --chat-templates

Thanks for catching it!

github-actions · 2026-06-23T07:33:51Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28002602936
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28002602936

github-actions · 2026-06-23T18:47:18Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28002602936
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28002602936

# Conflicts: # perf-changelog.yaml

MTP runs need --use-chat-template on run_benchmark_serving for meaningful acceptance, matching the other single-node MTP scripts.

github-actions · 2026-06-23T22:47:31Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28051750810
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28051750810

github-actions · 2026-06-24T02:59:52Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28051750810
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=28051750810

Ankur-singh · 2026-06-24T04:27:07Z

As a PR reviewer and CODEOWNER, I have reviewed this and have:

Verified that the general code quality meets the InferenceX standard and does not make the code quality any worse.
Verified that this PR has passed PR validation.
Verified that this PR passes evals.
If an company claims that they support vLLM/SGLang as first class LLM inference engines on their hardware, I have have verified that the respective vLLM/SGLang submission has been made before additional frameworks (TRT-LLM, ATOM, etc.). The only exceptions are for new hardware, such as MI455X UALoE72, Vera Rubin NVL72, Rubin NVL8, etc., and for new model architectures where there is an actual reason why vLLM/SGLang does not fundamentally support them yet.
Verified that the single-node recipes are similar to the official vLLM recipes and/or theSGLang cookbook:
- If they are not, I have verified that a PR has been opened in vLLM recipe repo or SGLang repo and linked it below in the additional detail section:
If any of the above criteria cannot reasonably be satisfied, I have provided additional reasoning below.

Additional detail section:

This is a TRTLLM config, hence no recipe required

Signed: ankur-singh

Klaud-Cold · 2026-06-24T04:29:21Z

@Ankur-singh Blocks merge: Check 3 fails — the sign-off's Additional detail section has no recipe link (only "This is a TRTLLM config"); this workflow requires a link even for a TRT-LLM config. Open/link a recipe (vllm-project/recipes or sglang cookbook) or the published recipe page.

Check 0 — PASS: nvidia-master.yaml owned by @Ankur-singh @kedarpotdar-nv @jgangani (signer listed); the .sh + perf-changelog fall to catch-all * @InferenceX/core, covered.
Check 1 — PASS: in-PR head ded5975 has green, non-skipped single-node 1k1k/8k1k / and eval / runs — https://github.com/SemiAnalysisAI/InferenceX/actions/runs/28051750810
Check 2 — PASS: gsm8k em_strict ~0.969 across configs, image tensorrt-llm/release:1.3.0rc18 matches the PR config.
Check 3 — FAIL: no recipe link present in the sign-off. Major server args (model/TP/EP/attention-DP/MTP/MoE backend/kv-cache fp8) cannot be checked without a linked recipe; a bare claim does not satisfy the standard.

Oseltamivir · 2026-06-24T04:58:21Z

/reuse-sweep-run

RohitNagraj requested a review from a team June 23, 2026 04:35

RohitNagraj requested review from Ankur-singh, jgangani and kedarpotdar-nv as code owners June 23, 2026 04:35

github-project-automation Bot added this to InferenceMAX Board Jun 23, 2026

RohitNagraj added 2 commits June 22, 2026 21:36

Update perf-changelog pr-link for #1894

ac7d261

Merge remote-tracking branch 'origin/main' into qwen3.5-fp4-b200-trt-mtp

7649ae1

# Conflicts: # perf-changelog.yaml

RohitNagraj added the full-sweep-enabled label Jun 23, 2026

cursor Bot reviewed Jun 23, 2026

View reviewed changes

functionstackx reviewed Jun 23, 2026

View reviewed changes

RohitNagraj added 2 commits June 23, 2026 12:24

Merge remote-tracking branch 'origin/main' into pr-1894-reuse-93299

dfcea05

# Conflicts: # perf-changelog.yaml

Enable chat template for qwen3.5 fp4 b200 trt MTP benchmark

ded5975

MTP runs need --use-chat-template on run_benchmark_serving for meaningful acceptance, matching the other single-node MTP scripts.

Ankur-singh approved these changes Jun 24, 2026

View reviewed changes

RohitNagraj requested a review from functionstackx June 24, 2026 04:44

Merge branch 'main' into qwen3.5-fp4-b200-trt-mtp

503f558

Oseltamivir merged commit 07cdcfb into main Jun 24, 2026
26 checks passed

Oseltamivir deleted the qwen3.5-fp4-b200-trt-mtp branch June 24, 2026 05:04

github-project-automation Bot moved this to Done in InferenceMAX Board Jun 24, 2026

claude Bot mentioned this pull request Jun 24, 2026

[AMD] Add MiniMax-M3-FP8 MI355X ATOM EAGLE3 / non-EAGLE3 update 0623 #1916

Open

8 tasks

+                  - 16
+                  - 32
+                  - 64
+                  - 128

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add qwen3.5-fp4-b200-trt-mtp single-node TensorRT-LLM benchmark#1894

Add qwen3.5-fp4-b200-trt-mtp single-node TensorRT-LLM benchmark#1894
Oseltamivir merged 6 commits into
mainfrom
qwen3.5-fp4-b200-trt-mtp

RohitNagraj commented Jun 23, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 23, 2026

Uh oh!

functionstackx Jun 23, 2026

Uh oh!

RohitNagraj Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Ankur-singh commented Jun 24, 2026 •

edited

Loading

Uh oh!

Klaud-Cold commented Jun 24, 2026

Uh oh!

Oseltamivir commented Jun 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

RohitNagraj commented Jun 23, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 23, 2026

Choose a reason for hiding this comment

CUDA graph sizes exceed max batch

Uh oh!

functionstackx Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

RohitNagraj Jun 23, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 23, 2026

Uh oh!

github-actions Bot commented Jun 24, 2026

Uh oh!

Ankur-singh commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Additional detail section:

Uh oh!

Klaud-Cold commented Jun 24, 2026

Uh oh!

Oseltamivir commented Jun 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

RohitNagraj commented Jun 23, 2026 •

edited by cursor Bot

Loading

Ankur-singh commented Jun 24, 2026 •

edited

Loading

Oseltamivir commented Jun 24, 2026 •

edited

Loading